BUSCOMP V0.11.0: run Wed Mar 3 13:51:06 2021
See the run details appendix end of this document for details of the log file, commandline parameters and runtime BUSCOMP errors and warnings.
NOTE: To edit this document, open yeast.N3L20ID0U.full.Rmd in RStudio, edit and re-knit to HTML.
Assemblies can be assessed on a number of criteria, but the main ones (in the absence of a reference “truth” genome) are either to judge contiguity or completeness. NG50 and LG50 values are based on a genome size of 13.1 Mb. If the genomesize=X parameter was not set (see command list in appendix), this will be based on the longest assembly (see sequence stats, below).
Of the 4 assemblies analysed (4 BUSCO; 4 fasta; 4 both), 3 genomes were rated as the “best” by at least one criterion:
PacBioHQ: NG50Length, LG50Count, MaxLength, Complete, Missing, BUSCO.PacBioWTDBG2: Complete, Missing.SGD: LG50Count, Complete, Missing, NoBUSCO.Best assemblies by assembly contiguity critera:
PacBioHQPacBioHQ, SGDPacBioHQBest assemblies by completeness critera:
PacBioHQ, PacBioWTDBG2, SGDPacBioHQ, PacBioWTDBG2, SGDPacBioHQSGD The following genomes and BUSCO results were analysed by BUSCOMP:
BUSCO|Fasta] SGD R64.2.1 reference genome (strain S288c)BUSCO|Fasta] High quality PacBio assembly of strain MBG344 (similar to S288c)BUSCO|Fasta] Duplicated chromosome III contig from High Quality PacBio assemblyBUSCO|Fasta] WTDBG2 PacBio assembly of strain MBG344 (similar to S288c)Details of the directories and files are below:
| Directory | Prefix | Genome | Fasta | Sequences |
|---|---|---|---|---|
| ../busco3/run_SGDR64.2.1 | SGDR64.2.1 | SGD | ../fasta/SGDR64.2.1.fsa | True |
| ../busco3/run_MBG344001 | MBG344001 | PacBioHQ | ../fasta/MBG344001.fsa | True |
| ../busco3/run_chrIIIdup | chrIIIdup | chrIIIdup | ../fasta/chrIIIdup.fsa | True |
| ../busco3/run_MBG344WTDBG2 | MBG344WTDBG2 | PacBioWTDBG2 | ../fasta/MBG344WTDBG2.fsa | True |
Genomes with a Directory listed had BUSCO results available. If Sequences is True, these would be have been compiled to generate the BUSCOMP sequence set (unless buscompseq=F, or alternative sequences were provided with buscofas=FASFILE). Genomes with a Fasta listed had sequence data available for BUSCOMP searches.
The following genome statistics were also calculated by RJE_SeqList for each genome (table, below):
SeqNum+GapCount).genomesize=X. If no genome size is given, it will be relative to the biggest assembly.genomesize=X. If no genome size is given, it will be relative to the biggest assembly.N) nucleotides in the assembly.N) regions in the assembly.| Genome | Description | SeqNum | TotLength | MinLength | MaxLength | MeanLength | MedLength | N50Length | L50Count | CtgNum | N50Ctg | L50Ctg | NG50Length | LG50Count | GapLength | GapCount | GC |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SGD | SGD R64.2.1 reference genome (strain S288c) | 18 | 12163423 | 6318 | 1531933 | 675745.7 | 706283.5 | 924431 | 6 | 18 | 924431 | 6 | 924431 | 6 | 0 | 0 | 38.15 |
| PacBioHQ | High quality PacBio assembly of strain MBG344 (similar to S288c) | 17 | 12275942 | 85779 | 1553502 | 722114.2 | 746591.0 | 930848 | 6 | 17 | 930848 | 6 | 930848 | 6 | 0 | 0 | 38.17 |
| chrIIIdup | Duplicated chromosome III contig from High Quality PacBio assembly | 2 | 695514 | 347757 | 347757 | 347757.0 | 347757.0 | 347757 | 1 | 2 | 347757 | 1 | 0 | -1 | 0 | 0 | 38.52 |
| PacBioWTDBG2 | WTDBG2 PacBio assembly of strain MBG344 (similar to S288c) | 29 | 12109398 | 4007 | 1524521 | 417565.4 | 344232.0 | 764884 | 6 | 29 | 764884 | 6 | 755429 | 7 | 0 | 0 | 37.85 |
NOTE: NG50Length and LG50Count statistics use genomesize=X or the biggest assembly loaded (13.10 Mb). If BUSCOMP has been run more than once on the same data (e.g. to update descriptions or sorting), please make sure that a consistent genome size is used, or these values may be wrong. If in doubt, run with force=T and force regeneration of statistics.
In general, a good assembly will be approx. the same size as the genome and in as few pieces as possible. Any assembly smaller than the predicted genome size is clearly missing coverage. Assemblies bigger than the genome size might still be missing chunks of the genome if redundancy/duplication is a problem. In the following plot, the grey line marks the given genome size of 13.1 Mb.
A better indicator of the overall coverage of the genome is the number of Missing BUSCO genes. As BUSCO is highly dependent on the accuracy of the sequence and the gene models it makes, the Missing BUSCOMP ratings arguably give a more consistent proxy for genome completeness. NOTE: this says nothing about the fragmentation or completeness of the genes themselves.
In general, a good assembly will be in fewer, bigger pieces. This is approximated using NG50 and LG50, which are the min. length and number of contigs/scaffolds required to cover at least half the genome. These stats use the given genome size of 13.1 Mb.
NOTE: To modify these plots and tables, edit the *.genomes.tdt and *.NxLxxIDxx.rdata.tdt files and re-knit the *.NxLxxIDxx.Rmd file.
Compiled BUSCO results for 4 assemblies and 2 groups have been saved in yeast.genomes.tdt. BUSCO ratings are defined (quoting from the BUSCO v3 User Guide as:
Complete: Single-copy hits where “BUSCO matches have scored within the expected range of scores and within the expected range of length alignments to the BUSCO profile.”Duplicated: As Complete but 2+ copies.Fragmented: “BUSCO matches … within the range of scores but not within the range of length alignments to the BUSCO profile.”Missing: “Either no significant matches at all, or the BUSCO matches scored below the range of scores for the BUSCO profile.”| Genome | N | Complete | Single | Duplicated | Fragmented | Missing |
|---|---|---|---|---|---|---|
| SGD | 1711 | 1683 | 1672 | 11 | 12 | 16 |
| PacBioHQ | 1711 | 1684 | 1673 | 11 | 10 | 17 |
| chrIIIdup | 1711 | 45 | 0 | 45 | 0 | 1666 |
| HighQuality | 1711 | 1684 | 1673 | 11 | 11 | 16 |
| PacBioWTDBG2 | 1711 | 1384 | 1375 | 9 | 135 | 192 |
| BUSCOMP | 1711 | 1689 | 1681 | 8 | 9 | 13 |
BUSCOMP compiled the following groups of genomes (where BUSCO data was loaded), keeping the “best” rating for each BUSCO gene across the group:
SGD PacBioHQ chrIIIdupSGD PacBioHQ chrIIIdup PacBioWTDBG2SGD BUSCO Results:
C:98.4%[S:97.7%,D:0.6%],F:0.7%,M:0.9%,n:1711
PacBioHQ BUSCO Results:
C:98.4%[S:97.8%,D:0.6%],F:0.6%,M:1.0%,n:1711
chrIIIdup BUSCO Results:
C:2.6%[S:0.0%,D:2.6%],F:0.0%,M:97.4%,n:1711
HighQuality BUSCO Results:
C:98.4%[S:97.8%,D:0.6%],F:0.6%,M:0.9%,n:1711
PacBioWTDBG2 BUSCO Results:
C:80.9%[S:80.4%,D:0.5%],F:7.9%,M:11.2%,n:1711
BUSCOMP BUSCO Results:
C:98.7%[S:98.2%,D:0.5%],F:0.5%,M:0.8%,n:1711
Full BUSCO results with ratings for each gene have been compiled in yeast.busco.tdt:
The best complete BUSCO hit results (based on Score and Length) have been compiled in yeast.buscoseq.tdt. The Genome field indicates the assembly with the best hit, which is followed by details of that hit (Contig, Start, End, Score, Length). BUSCOMP ratings for each assembly are then given in subsequent fields:
* `Identical`: 100% coverage and 100% identity in at least one contig/scaffold.
* `Complete`: 95%+ Coverage in a single contig/scaffold. (Note: accuracy/identity is not considered.)
* `Duplicated`: 95%+ Coverage in 2+ contigs/scaffolds.
* `Fragmented`: 95%+ combined coverage but not in any single contig/scaffold.
* `Partial`: 40-95% combined coverage.
* `Ghost`: Hits meeting local cutoff but <40% combined coverage.
* `Missing`: No hits meeting local cutoff.
BUSCOMP ratings (see above) are compiled to summary statistics in yeast.N3L20ID0U.ratings.tdt. Note that Identical ratings in this table will also be rated as Complete, which in turn are Single or Duplicated. Percentage summaries are plotted below, along with a BUSCO-style one-line summary per assembly/group.
NOTE: Group summaries do not include Identical ratings.
| X. | Genome | N | Identical | Complete | Single | Duplicated | Fragmented | Partial | Ghost | Missing |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | SGD | 1681 | 1609 | 1679 | 1678 | 1 | 0 | 2 | 0 | 0 |
| 2 | PacBioHQ | 1681 | 1611 | 1679 | 1678 | 1 | 0 | 2 | 0 | 0 |
| 3 | chrIIIdup | 1681 | 43 | 45 | 0 | 45 | 0 | 1 | 1 | 1634 |
| 4 | HighQuality | 1681 | 0 | 1679 | 1678 | 1 | 0 | 2 | 0 | 0 |
| 5 | PacBioWTDBG2 | 1681 | 1018 | 1679 | 1677 | 2 | 0 | 2 | 0 | 0 |
| 6 | BUSCOMP | 1681 | 0 | 1679 | 1678 | 1 | 0 | 2 | 0 | 0 |
BUSCOMP BUSCOMP Results [1689 (98.71%) Complete BUSCOs; 0 (0.00%) BUSCOMP Seqs]:
C:99.9%[S:99.8%,D:0.1%],F:0.0%,P:0.1%,G:0.0%,M:0.0%,n:1681
HighQuality BUSCOMP Results [1684 (98.42%) Complete BUSCOs; 0 (0.00%) BUSCOMP Seqs]:
C:99.9%[S:99.8%,D:0.1%],F:0.0%,P:0.1%,G:0.0%,M:0.0%,n:1681
PacBioHQ BUSCOMP Results [1684 (98.42%) Complete BUSCOs; 1599 (95.12%) BUSCOMP Seqs]:
C:99.9%[S:99.8%,D:0.1%,I:95.8%],F:0.0%,P:0.1%,G:0.0%,M:0.0%,n:1681
PacBioWTDBG2 BUSCOMP Results [1384 (80.89%) Complete BUSCOs; 77 (4.58%) BUSCOMP Seqs]:
C:99.9%[S:99.8%,D:0.1%,I:60.6%],F:0.0%,P:0.1%,G:0.0%,M:0.0%,n:1681
SGD BUSCOMP Results [1683 (98.36%) Complete BUSCOs; 5 (0.30%) BUSCOMP Seqs]:
C:99.9%[S:99.8%,D:0.1%,I:95.7%],F:0.0%,P:0.1%,G:0.0%,M:0.0%,n:1681
chrIIIdup BUSCOMP Results [45 (2.63%) Complete BUSCOs; 0 (0.00%) BUSCOMP Seqs]:
C:2.7%[S:0.0%,D:2.7%,I:2.6%],F:0.0%,P:0.1%,G:0.1%,M:97.2%,n:1681
Full BUSCOMP results with ratings for each gene in every assembly and group have been compiled in yeast.N3L20ID0U.buscomp.tdt:
Ratings changes from BUSCO to BUSCOMP (where NULL ratings indicate no BUSCOMP sequence):
| BUSCO | BUSCOMP | SGD | PacBioHQ | chrIIIdup | PacBioWTDBG2 | TOTAL |
|---|---|---|---|---|---|---|
| Complete | Complete | 1670 | 1671 | 0 | 1372 | 4713 |
| Complete | Duplicated | 0 | 0 | 0 | 1 | 1 |
| Complete | Partial | 2 | 2 | 0 | 2 | 6 |
| Duplicated | Complete | 2 | 2 | 0 | 1 | 5 |
| Duplicated | Duplicated | 1 | 1 | 45 | 0 | 47 |
| Duplicated | NULL | 8 | 8 | 0 | 8 | 24 |
| Fragmented | Complete | 3 | 1 | 0 | 126 | 130 |
| Fragmented | Duplicated | 0 | 0 | 0 | 1 | 1 |
| Fragmented | NULL | 9 | 9 | 0 | 8 | 26 |
| Missing | Complete | 3 | 4 | 0 | 178 | 185 |
| Missing | Ghost | 0 | 0 | 1 | 0 | 1 |
| Missing | Missing | 0 | 0 | 1634 | 0 | 1634 |
| Missing | NULL | 13 | 13 | 30 | 14 | 70 |
| Missing | Partial | 0 | 0 | 1 | 0 | 1 |
Full table of Ratings changes from by gene:
Complete, Duplicated, Fragmented, Partial, Ghost, Missing, NULL (no BUSCOMP sequence)
There is a risk that performing a low stringency search will identify homologues or pseudogenes of the desired BUSCO gene in error. If there is a second copy of a gene in the genome that is detectable by the search then we would expect the same genes that go from Missing to Complete in some genomes to go from Single to Duplicated in others.
To test this, data is reduced for each pair of genomes to BUSCO-BUSCOMP rating pairs of:
Single-SingleSingle-DuplicatedMissing-MissingMissing-SingleThis is then converted in to Gain ratings (Single-Duplicated & Missing-Single) or No Gain ratings (Single-Single & Missing-Missing). The Single-Duplicated shift in one genome is then used to set the expected Missing-Single shift in the other, and assess the probability of observing the Missing-Single shift using a cumulative binomial distribution, where:
k is the number of observed GG pairs (Single-Duplicated and Missing-Single)n is the number of Missing-Single Gains in the focal genome (NG+GG)p is the proportion of Single-Duplicated Gains in the background genome (GN+GG / (GN+GG+NN+NG))pB is the probability of observing k+ Missing-Single gains, given p and nThis is output to *.gain.tdt, where each row is a Genome and each field gives the probability of the row genome’s Missing-Single gains, given the column genome’s Single-Duplicated gains:
| Genome | SGD | PacBioHQ | chrIIIdup | PacBioWTDBG2 |
|---|---|---|---|---|
| PacBioHQ | 1 | 1 | 1 | 1 |
| PacBioWTDBG2 | 1 | 1 | 1 | 1 |
| SGD | 1 | 1 | 1 | 1 |
| chrIIIdup | 1 | 1 | 1 | 1 |
Low probabilities indicate that BUSCOMP might be rating paralogues or pseudogenes and not functional orthologues of the BUSCO gene. Note that there is no correction for multiple testing, nor any adjustment for lack of independence between samples.
BUSCO and BUSCOMP Complete ratings were compared for each BUSCO gene to identify those genes unique to either a single assembly or a group of assemblies. The BUSCOMP group is excluded from this analysis, as (typically) are other redundant groups wholly contained within another group. (Inclusion of such groups is guaranteed to result in 2+ groups containing any Complete BUSCOs they have.)
SGD unique Complete genes: 0 BUSCO; 0 BUSCOMP
PacBioHQ unique Complete genes: 1 BUSCO; 0 BUSCOMP
chrIIIdup unique Complete genes: 0 BUSCO; 0 BUSCOMP
PacBioWTDBG2 unique Complete genes: 5 BUSCO; 0 BUSCOMP
HighQuality unique Complete genes: 0 BUSCO; 0 BUSCOMP
In addition to the unique ratings (above), it can be useful to know how genes Missing from one assembly/group are rated in the others. These plots are generated for each assembly/group in turn. The full BUSCO (*.busco.tdt) and BUSCOMP (*.LnnIDxx.buscomp.tdt) tables are reduced to the subset of genes that are missing in the assembly/group of interest, and then the summary ratings recalculated for that subset.
In each case, three plots are made (assuming both BUSCO and BUSCOMP data is available):
BUSCO ratings for Missing SGD BUSCO genes:
BUSCOMP ratings for Missing SGD BUSCO genes:
BUSCOMP ratings for Missing SGD BUSCOMP genes:
BUSCO ratings for Missing PacBioHQ BUSCO genes:
BUSCOMP ratings for Missing PacBioHQ BUSCO genes:
BUSCOMP ratings for Missing PacBioHQ BUSCOMP genes:
BUSCO ratings for Missing chrIIIdup BUSCO genes:
BUSCOMP ratings for Missing chrIIIdup BUSCO genes:
BUSCOMP ratings for Missing chrIIIdup BUSCOMP genes:
BUSCO ratings for Missing HighQuality BUSCO genes:
BUSCOMP ratings for Missing HighQuality BUSCO genes:
BUSCOMP ratings for Missing HighQuality BUSCOMP genes:
BUSCO ratings for Missing PacBioWTDBG2 BUSCO genes:
BUSCOMP ratings for Missing PacBioWTDBG2 BUSCO genes:
BUSCOMP ratings for Missing PacBioWTDBG2 BUSCOMP genes:
BUSCOMP V0.11.0: run Wed Mar 3 13:51:06 2021
This analysis was run in:
/Users/redwards/OneDrive - UNSW/projects/BUSCOMP-Jan19/githubdev/example/run
/Users/redwards/OneDrive - UNSW/projects/BUSCOMP-Jan19/githubdev/example/run/yeast.logini=../run/example.ini i=-1 forks=4minimap2=minimap2 ini=../run/example.ini genomesize=13.1e6 genomes=../example.genomes.csv groups=../example.groups.csv runs=../busco3/run_* fastadir=../fasta/ basefile=yeast i=-1 forks=4BUSCOMP returned no runtime errors.
See run log for further details:
#WARN 00:00:04 "Single copy" BUSCO EOG092E01WO has 2+ sequences in ../busco3/run_MBG344001/single_copy_busco_sequences/EOG092E01WO.fna! (Keeping first.)
#WARN 00:00:04 "Single copy" BUSCO EOG092E01WO has 2+ sequences in ../busco3/run_MBG344001/single_copy_busco_sequences/EOG092E01WO.faa! (Keeping first.)
#WARN 00:00:05 "Single copy" BUSCO EOG092E0EIP has 2+ sequences in ../busco3/run_MBG344WTDBG2/single_copy_busco_sequences/EOG092E0EIP.fna! (Keeping first.)
#WARN 00:00:05 "Single copy" BUSCO EOG092E0EIP has 2+ sequences in ../busco3/run_MBG344WTDBG2/single_copy_busco_sequences/EOG092E0EIP.faa! (Keeping first.)
Report contents:
Output generated by BUSCOMP v0.11.0 © 2019 Richard Edwards | richard.edwards@unsw.edu.au